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Abstract 

This  paper  presents  two  algorithms  based  on  the  horizon¬ 
tal  and  vertical  pattern  discovery  paradigms  that  find  the 
connected  subgraphs  that  have  a  sufficient  number  of  edge- 
disjoint  embeddings  in  a  single  large  undirected  labeled 
sparse  graph.  These  algorithms  use  three  different  methods 
to  determine  the  number  of  the  edge-disjoint  embeddings  of  a 
subgraph  that  are  based  on  approximate  and  exact  maximum 
independent  set  computations  and  use  it  to  prune  infrequent 
subgraphs.  Experimental  evaluation  on  real  datasets  from 
various  domains  show  that  both  algorithms  achieve  good  per¬ 
formance,  scale  well  to  sparse  input  graphs  with  more  than 
100,000  vertices  and  around  200,000  edges,  and  significantly 
outperform  previously  developed  algorithms. 

Keywords  pattern  discovery,  frequent  subgraph,  graph  min¬ 
ing. 

1  Introduction 

Data  mining  is  the  process  of  automatically  extracting  new 
and  useful  knowledge  hidden  in  large  datasets.  This  emerging 
discipline  is  becoming  increasingly  important  as  advances  in 
data  collection  have  led  to  the  explosive  growth  in  the  amount 
of  available  data. 

In  recent  years,  there  has  been  an  increased  interest  in 
developing  data  mining  algorithms  that  operate  on  graphs. 
Such  graphs  arise  naturally  in  a  number  of  different  appli¬ 

*This  work  was  supported  in  part  by  NSF  CCR-9972519,  EIA-9986042, 
ACI-9982274,  ACI-0133464,  and  ACI-0312828;  the  Digital  Technology 
Center  at  the  University  of  Minnesota;  and  by  the  Army  High  Performance 
Computing  Research  Center  (AHPCRC)  under  the  auspices  of  the  Depart¬ 
ment  of  the  Army,  Army  Research  Laboratory  (ARL)  under  Cooperative 
Agreement  number  DA  AD  19-0 1-2-00 14.  The  content  of  which  does  not 
necessarily  reflect  the  position  or  the  policy  of  the  government,  and  no  of¬ 
ficial  endorsement  should  be  inferred.  Access  to  research  and  computing 
facilities  was  provided  by  the  Digital  Technology  Center  and  the  Minnesota 
Supercomputing  Institute. 


cation  domains  including  network  intrusion  [47,  41],  seman¬ 
tic  web  [4],  behavioral  modeling  [67,  55],  VLSI  reverse  en¬ 
gineering  [70],  link  analysis  [34,  40,  39,  58],  and  chemical 
compound  classification  [14,  43,  22,  16].  Moreover,  they  can 
be  used  to  effectively  model  the  structural  and  relational  char¬ 
acteristics  of  a  variety  of  datasets  arising  in  other  areas  such 
as  physical  sciences  (e.g.,  chemistry,  fluid  dynamics,  astron¬ 
omy,  structural  mechanics,  and  ecosystem  modeling),  life  sci¬ 
ences  (e.g.,  genomics,  proteomics,  pharmacogenomics,  and 
health  informatics),  and  home-land  defense  (e.g.,  informa¬ 
tion  assurance,  network  intrusion,  infrastructure  protection, 
and  terrorist-threat  prediction/identification). 

The  focus  of  this  paper  is  on  developing  algorithms  for  a 
particular  data  mining  task,  which  is  that  of  finding  frequently 
occurring  patterns  in  graph  datasets.  Frequent  patterns  play 
a  critical  role  in  many  data  mining  tasks  as  they  can  be  used 
among  other  to  derive  association  rules  [1],  act  as  composite 
features  for  classification  algorithms  [14,  56,  63,  51,  22,  50, 
15],  cluster  the  (graph)  transactions  [1,  48,  35,  36,  49,  24], 
and  help  in  determining  the  similarity  between  graphs  [54, 
23, 42, 59,  9,  49,  13, 60, 66],  Within  the  context  of  graphs,  the 
most  widely  used  definition  of  a  pattern  is  that  of  a  connected 
subgraph  [8,  68,  32,  29,  69,  30,  44]  and  is  the  definition 
that  we  will  use  in  this  paper.  However,  different  pattern 
definitions  have  been  proposed  as  well  [32]. 

There  are  two  distinct  problem  formulations  for  frequent 
pattern  mining  in  graph  datasets  that  are  referred  to  as  the 
graph-transaction  setting  and  the  single-graph  setting.  In 
the  graph-transaction  setting,  the  input  to  the  pattern  mining 
algorithm  is  a  set  of  relatively  small  graphs  (called  transac¬ 
tions),  whereas  in  the  single-graph  setting  the  input  data  is 
a  single  large  graph.  The  difference  affects  the  way  the  fre¬ 
quency  of  the  various  patterns  is  determined.  For  the  graph- 
transaction  setting,  the  frequency  of  a  pattern  is  determined 
by  the  number  of  graph  transactions  that  the  pattern  occurs  in, 
irrespective  of  how  many  times  a  pattern  occurs  in  a  partic- 


ular  transaction,  whereas  in  the  single-graph  setting,  the  fre¬ 
quency  of  a  pattern  is  based  on  the  number  of  its  occurrences 
(i.e.,  embeddings)  in  the  single  graph.  Due  to  the  inherent 
differences  of  the  characteristics  of  the  underlying  dataset  and 
the  problem  formulation,  algorithms  developed  for  the  graph- 
transaction  setting  cannot  be  used  to  solve  the  single-graph 
setting,  whereas  the  latter  algorithms  can  be  easily  adapted  to 
solve  the  former  problem. 

In  recent  years,  a  number  of  efficient  and  scalable  algo¬ 
rithms  have  been  developed  to  find  patterns  in  the  graph- 
transaction  setting  [8,  68,  32,  29,  69,  30,  44],  These  algo¬ 
rithms  are  complete  in  the  sense  that  they  are  guaranteed  to 
discover  all  frequent  subgraphs  and  were  shown  to  scale  to 
very  large  graph  datasets.  However,  developing  algorithms 
that  are  capable  of  finding  patterns  in  the  single-graph  set¬ 
ting  has  received  much  less  attention,  despite  the  fact  that 
this  problem  setting  is  more  generic  and  applicable  to  a  wider 
range  of  datasets  and  application  domains  than  the  other. 
Moreover,  existing  algorithms  that  are  guaranteed  to  find  all 
frequent  patterns  [21,  65]  or  algorithms  that  are  heuristic, 
such  as  GBI  [71]  and  SUBDUE  [28]  which  tend  to  miss  a 
large  number  of  frequent  patterns,  are  computationally  ex¬ 
pensive  and  do  not  scale  to  large  datasets. 

Developing  algorithms  that  find  the  complete  set  of  fre¬ 
quent  patterns  in  the  single-graph  setting  is  the  focus  of 
this  paper.  We  present  two  computationally  efficient  algo¬ 
rithms  that  can  find  subgraphs  which  are  frequently  embed¬ 
ded  within  a  large  sparse  graph.  The  first  algorithm,  called 
hSiGraM,  follows  a  horizontal  approach  and  finds  the  fre¬ 
quent  subgraphs  in  a  breadth-first  fashion,  whereas  the  second 
algorithm,  called  vSiGraM,  follows  a  vertical  approach  and 
finds  the  frequent  subgraphs  in  a  depth-first  fashion.  These  al¬ 
gorithms  incorporate  efficient  algorithms  for  candidate  gen¬ 
eration  and  frequency  counting  that  allow  them  to  scale  to 
graphs  containing  over  100,000  vertices  and  find  patterns 
with  relatively  low  occurrence  frequency.  Our  experimen¬ 
tal  evaluation  on  six  real  graphs  shows  that  both  HSiGraM 
and  VSiGraM  achieve  reasonably  good  performance,  scale 
to  large  graphs,  and  substantially  outperform  previously  de¬ 
veloped  approaches  for  solving  similar  or  simpler  versions  of 
the  problem. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  de¬ 
fines  the  graph  model  that  we  use,  reviews  some  graph-related 
definitions,  and  introduces  the  notation  that  is  used  in  the  pa¬ 
per.  Section  3  surveys  related  research  in  this  area.  Section  4 
formally  defines  the  problem  of  frequent  subgraph  discovery 
and  discusses  the  challenges  associated  with  finding  them  in 
a  computationally  efficient  manner.  Section  5  describes  in  de¬ 
tail  the  HSiGraM  and  VSiGraM  algorithms  that  we  devel¬ 
oped  for  solving  the  problem  of  frequent  subgraph  discovery 
from  a  single  large  sparse  graph.  Section  6  provides  a  detailed 
experimental  evaluation  of  the  HSiGraM  and  VSiGraM  al¬ 
gorithms  on  various  real  datasets  and  compares  them  against 
existing  algorithms.  Finally,  Section  7  provides  some  con¬ 
cluding  remarks. 


2  Definitions  and  Notation 

A  graph  G  —  (V,  £)  is  made  of  two  sets,  the  set  of  vertices 
V  and  the  set  of  edges  E.  Each  edge  itself  is  a  pair  of 
vertices,  and  throughout  this  paper  we  assume  that  the  graph 
is  undirected,  i.e.,  each  edge  is  an  unordered  pair  of  vertices. 
Furthermore,  we  will  assume  that  the  graph  is  labeled.  That 
is,  each  vertex  and  edge  has  a  label  associated  with  it  that  is 
drawn  from  a  predefined  set  of  vertex  labels  (Ly)  and  edge 
labels  ( Le ).  Each  vertex  (or  edge)  of  the  graph  is  not  required 
to  have  a  unique  label  and  the  same  label  can  be  assigned  to 
many  vertices  (or  edges)  in  the  same  graph.  If  all  the  vertices 
and  edges  of  the  graph  have  the  same  vertex  and  edge  label 
assigned  to  them,  we  will  call  this  graph  unlabeled. 

Given  a  graph  G  =  (V,  E),  a  graph  Gs  =  (Vs,  Es)  is  a 
subgraph  of  G  if  and  only  if  V)  cy  and  Es  C  E.  A  graph  is 
connected  if  there  is  a  path  between  every  pair  of  vertices  in 
the  graph.  Two  graphs  G  \  =  (Vj ,  E\)  and  G i  =  (Vj,  £2)  are 
isomorphic  if  they  are  topologically  identical  to  each  other, 
that  is,  there  is  a  mapping  from  V\  to  Vi  such  that  each  edge 
in  £1  is  mapped  to  a  single  edge  in  £ 2  and  vice  versa.  In 
the  case  of  labeled  graphs,  this  mapping  must  also  preserve 
the  labels  on  the  vertices  and  edges.  An  automorphism  is 
an  isomorphism  mapping  where  G\  —  G  2  ■  Given  two  graphs 
G 1  =  (  V\ ,  £1)  and  G2  —  (V2,  £ 2),  the  problem  of  subgraph 
isomorphism  is  to  find  an  isomorphism  between  G 2  and  a 
subgraph  of  G i.e.,  determine  whether  or  not  G2  is  included 
in  G\. 

Given  a  subgraph  G s  and  a  graph  Q,  two  embeddings  of 
Gs  in  Q  are  called  identical  if  they  use  the  same  set  of  edges 
of  Q,  and  they  are  called  edge-disjoint  if  they  do  not  have 
any  edges  of  Q  in  common.  Given  a  set  of  all  embeddings 
of  a  particular  subgraph  Gs  in  a  graph  Q,  the  overlap  graph 
of  Gs  is  a  graph  obtained  by  creating  a  vertex  for  each  non¬ 
identical  embedding  and  creating  an  edge  for  each  pair  of 
non-edge-disjoint  embeddings.  An  example  of  a  subgraph 
and  its  overlap  graph  are  shown  in  Figure  2. 

The  notation  that  we  will  be  using  throughout  the  paper  is 
shown  in  Table  1 . 

2.1  Canonical  Labeling 

One  of  the  key  operations  required  by  any  frequent  sub¬ 
graph  discovery  algorithm  is  a  mechanism  by  which  to  check 
whether  two  subgraphs  are  identical  or  not.  One  way  of  per¬ 
forming  this  check  is  to  perform  a  graph  isomorphism  op¬ 
eration.  However,  in  cases  in  which  many  such  checks  are 
required  among  the  same  set  of  subgraphs,  a  better  way  of 
performing  this  task  is  to  assign  to  each  graph  a  unique  code 
(i.e.,  a  sequence  of  bits,  a  string,  or  a  sequence  of  numbers) 
that  is  invariant  on  the  ordering  of  the  vertices  and  edges  in 
the  graph.  Such  a  code  is  referred  to  as  the  canonical  label 
of  a  graph  G  =  (V,  £)  [61,  18],  and  we  will  denote  it  by 
cl(G).  By  using  canonical  labels,  we  can  check  whether  or 
not  two  graphs  are  identical  by  checking  to  see  whether  they 
have  identical  canonical  labels.  Moreover,  by  comparing  the 
canonical  labels  we  can  obtain  a  complete  ordering  of  a  set 
of  graphs  in  a  unique  and  deterministic  way,  regardless  of  the 
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Table  1 :  Notation  used  throughout  the  paper 


Notation 

Description 

k-  subgraph 

A  connected  subgraph  with  k  edges 
(also  written  as  a  size-/:  subgraph) 

Gk.  Hk 

Graphs  of  size  k 

E(G) 

Edges  of  a  graph  G 

V(G ) 

Vertices  of  a  graph  G 

cl(G) 

Canonical  label  of  a  graph  G 

dia(G) 

Diameter  of  a  graph  G 

a,  b,  c,  e,  f 

Edges 

u,  V 

Vertices 

d(v) 

Degree  of  a  vertex  v 

m 

Label  of  a  vertex  v 

He) 

Label  of  an  edge  e 

H  =  G  —  e 

H  is  a  graph  obtained  by  the  deletion  of 
edge  e  e  E(G ) 

G 

Input  graph 

Gi 

G’s  connected  component 

S(Gk+') 

Set  of  all  connected  size-/:  subgraphs  of  G^"*"* 

-M(G)  =  {m,-} 

All  embeddings  of  a  subgraph  G  in  Q 

-4(G)  =  {e,) 

All  anchor  edges  of  a  subgraph  G  in  Q 

c 

Candidate  subgraph 

Ck 

Set  of  candidates  with  k  edges 

C 

Set  of  all  candidates 

F 

Frequent  subgraph 

Fk 

Set  of  frequent  ^-subgraphs 

T 

Set  of  all  frequent  subgraphs 

k* 

Size  of  the  largest  frequent  subgraph  in  Q 

le 

Set  of  all  edge  labels  in  Q 

Ly 

Set  of  all  vertex  labels  in  Q 

original  vertex  and  edge  ordering. 

A  simple  way  of  defining  the  canonical  label  of  a  graph  is 
as  the  string  obtained  by  concatenating  the  upper  triangular 
entries  of  the  graph’s  adjacency  matrix  when  this  matrix  has 
been  symmetrically  permuted  so  that  this  string  becomes  the 
lexicographically  largest  (or  smallest)  over  the  strings  that 
can  be  obtained  from  all  such  permutations.  This  is  illustrated 
in  Figure  1  that  shows  a  graph  G3  and  the  permutation  of  its 
adjacency  matrix1  that  leads  to  its  canonical  label  “ aaazyx ”. 
In  this  code,  “aaa”  was  obtained  by  concatenating  the  vertex- 
labels  in  the  order  that  they  appear  in  the  adjacency  matrix 
and  “zyx”  was  obtained  by  concatenating  the  columns  of 
the  upper  triangular  portion  of  the  matrix.  Note  that  any 
other  permutation  of  G3’s  adjacency  matrix  will  lead  to  a 
code  that  is  lexicographically  smaller  (or  equal)  to  “aaazyx”. 
If  a  graph  has  |  V  |  vertices,  the  complexity  of  determining 
its  canonical  label  using  this  scheme  is  in  0(|V|!)  making 
it  impractical  even  for  moderate  size  graphs.  Note  that 
the  problem  of  determining  the  canonical  label  of  a  graph 
is  equivalent  to  determining  isomorphism  between  graphs, 
because  if  two  graphs  are  isomorphic  with  each  other,  their 
canonical  labels  must  be  identical.  Both  canonical  labeling 
and  determining  graph  isomorphism  are  not  known  to  be 
either  in  P  or  in  NP-complete  [18]. 

In  practice,  the  complexity  of  finding  a  canonical  labeling 
of  a  graph  can  be  reduced  by  using  various  heuristics  to 
narrow  down  the  search  space  or  by  using  alternate  canonical 
label  definitions  that  take  advantage  of  special  properties  that 
may  exist  in  a  particular  set  of  graphs  [53,  52,  18].  As  part 
of  our  earlier  research  we  have  developed  such  canonical 
labeling  algorithm  that  fully  makes  use  of  edge-  and  vertex- 

'The  symbol  Vj  in  the  figure  is  a  vertex  ID,  not  a  vertex  label,  and 
blank  elements  in  the  adjacency  matrix  means  there  is  no  edge  between  the 
corresponding  pair  of  vertices. 
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code  =  aaa  zxy  code  =  aaa  zyx 


(a)  G3  (b)  (c) 

Figure  1:  Simple  examples  of  codes  and  canonical  adjacency 
matrices 

labels  for  fast  processing  and  various  vertex  invariants  to 
reduce  the  complexity  of  determining  the  canonical  label  of 
a  graph  [45,  46].  Our  algorithm  can  compute  the  canonical 
label  of  graphs  containing  up  to  50  vertices  extremely  fast  and 
will  be  the  algorithm  used  to  compute  the  canonical  labels  of 
the  different  subgraphs  in  this  paper. 

2.2  Maximum  Independent  Set 

As  discussed  later  in  Section  4,  our  frequent  subgraph  dis¬ 
covery  algorithm  focuses  on  finding  subgraphs  whose  embed¬ 
dings  are  edge-disjoint.  A  critical  step  in  obtaining  this  set  of 
edge-disjoint  embeddings  for  a  particular  subgraph  is  to  find 
the  maximum  independent  set  of  its  overlap  graph.  Given  a 
graph  G  —  (V,  E),  a  subset  of  vertices  I  C  k  is  called  in¬ 
dependent  if  no  two  vertices  in  I  are  connected  by  an  edge 
in  E.  An  independent  set  I  is  called  maximal  independent 
set  for  every  vertex  v  in  /  if  there  is  an  edge  in  E  that  con¬ 
nects  v  to  a  vertex  in  V  \  I.  A  maximal  independent  set  I 
is  called  maximum  independent  set  (MIS)  if  I  contains  as 
many  vertices  of  V  as  possible. 

The  problem  of  finding  the  MIS  of  a  graph  was  among 
the  first  problems  proved  to  be  in  NP-complete  [19],  and  re¬ 
mains  so  even  for  bounded  degree  graphs.  Moreover,  it  has 
been  shown  that  the  size  of  MIS  cannot  be  approximated  even 
within  a  factor  of  jn  polynomial  time  [17].  However, 

the  importance  of  the  problem  and  its  applicability  to  a  wide- 
range  of  domains  has  attracted  a  considerable  amount  of  re¬ 
search.  This  research  has  been  focused  on  developing  both 
faster  exact  algorithms  as  well  as  approximate  algorithms. 
The  faster  exact  algorithm  to  date  is  the  algorithm  by  Rob¬ 
son  [62]  that  solves  the  MIS  problem  in  time  0(1.211"), 
making  it  possible  to  solve  in  reasonable  amount  of  time 
problem  instances  containing  up  to  around  100  vertices.  In 
this  study,  we  used  a  fast  implementation  of  the  exact  max¬ 
imum  clique  (MC)  problem  solver  wclique  [57]  instead  of 
those  fast  exact  MIS  algorithms.  Because  the  MIS  problem 
on  a  graph  G  is  equivalent  to  the  MC  problem  on  a  G’s  com¬ 
plement  graph  G,  we  can  use  wclique  as  a  fast  exact  MIS  al¬ 
gorithm  (EM  IS).  Heuristic  algorithms  focus  on  finding  max¬ 
imal  independent  sets  whose  size  is  bounded  in  terms  of  the 
size  of  the  optimal  solution,  and  a  number  of  such  methods 
have  been  developed  [27,  6,  38,  25]. 

One  of  the  most  widely  used  heuristic  is  the  greedy 
algorithm  (GMIS)  which  selects  a  vertex  of  the  minimum 
degree,  deletes  that  vertex  and  all  of  its  neighbors  from 
the  graph,  and  repeats  this  process  until  the  graph  becomes 
empty.  A  recent  detailed  analysis  of  the  GMIS  algorithm  has 


shown  that  it  produces  reasonably  good  approximations  of  the 
MIS  for  bounded-  and  low-degree  graphs  [25].  In  particular, 
for  a  graph  G  with  a  maximum  degree  A  and  an  average 
degree  d,  the  size  |/|  of  the  MIS  satisfies  the  following: 

(2.1)  |/|  <min|  ^±^|GMIS(G)|,^|GMIS(G)|J 


Figure  3:  Patterns  with  the  non-monotonic  frequency 


where  |GMIS(G)|  is  the  size  of  the  approximate  MIS  found 
by  the  G MIS  algorithm. 

3  Related  Work 

The  previous  research  on  finding  frequent  subgraphs  in  graph 
datasets  falls  under  two  categories.  The  first  category  con¬ 
tains  algorithms  for  finding  subgraphs  that  occur  multiple 
times  in  a  single  input  graph  [71,  28,  21,  65]  and  are  directly 
related  to  the  algorithms  presented  in  this  paper,  whereas  the 
second  category  contains  algorithms  that  find  subgraphs  that 
occur  frequently  across  a  database  of  small  graphs  [14,  31, 
43,  45,  33,  8,  68,  32,  29,  30,  44],  Between  these  two  classes 
of  algorithms,  those  developed  for  the  latter  problem  are  in 
general  more  mature  as  they  have  moderate  computational 
requirements  and  scale  to  large  datasets. 

In  the  rest  of  this  section,  we  will  describe  on  the  related 
research  for  the  single-graph  setting  as  it  is  directly  related  to 
the  topic  of  the  paper. 

The  most  well-known  algorithm  for  finding  recurring  sub¬ 
graphs  in  a  single  large  graph  is  the  SUBDUE  system,  orig¬ 
inally  developed  in  1994,  and  improved  over  the  years  [28, 
10,  12,  11],  SUBDUE  is  an  approximate  algorithm  and 
finds  patterns  that  can  compress  the  original  input  graph  by 
substituting  those  patterns  with  a  single  vertex.  In  evalu¬ 
ating  the  extent  to  which  a  particular  pattern  can  compress 
the  original  graph  it  uses  the  minimum  description  length 
(MDL)  principle,  and  employs  a  heuristic  beam  search  to 
narrow  the  search-space.  These  approximations  improve  its 
computational  efficiency  but  at  the  same  time  it  prevents  it 
from  finding  subgraphs  that  are  indeed  frequent.  GBI  [71] 
is  another  greedy  heuristics  based  algorithm  similar  to  SUB¬ 
DUE.  Ghazizadeh  and  Chawathe  [21]  developed  an  algorithm 
called  SEuS  that  uses  a  data  stmcture  called  summary  to  con¬ 
struct  a  lossy  compressed  representation  of  the  input  graph. 
This  summary  is  obtained  by  collapsing  together  all  the  ver¬ 
tices  of  the  input  graph  that  have  the  same  label  and  is  used 
to  quickly  prune  infrequent  candidates.  As  the  authors  in¬ 
dicate,  this  summary  data-structure  is  useful  only  when  the 
input  graph  contains  a  relatively  small  number  of  frequent 
subgraphs  with  high  frequency,  and  is  not  effective  if  there 
are  a  large  number  of  frequent  subgraphs  with  low  frequency. 
Finally,  Vanetik,  Gudes  and  Shimony  [65]  presented  an  al¬ 
gorithm  for  finding  all  frequently  occurring  subgraphs  from 
a  single  labeled  undirected  graph  using  the  maximum  num¬ 
ber  of  edge-disjoint  embeddings  of  a  graph  as  a  measure  of 
its  frequency.  Each  subgraph  is  represented  by  its  minimum 
number  of  edge-disjoint  paths  ( path  number ),  and  use  a  level- 
by-level  approach  to  grow  the  patterns  based  on  their  path- 
number.  Their  emphasis  is  on  efficient  candidate  generation 


and  no  special  attention  is  paid  for  frequency  counting. 

4  Discovering  Frequent  Patterns  in  a  Single  Graph: 

Problem  Definition 

A  fundamental  issue  that  needs  to  be  considered  by  any  fre¬ 
quent  subgraph  discovery  problem  formulation  similar  to  the 
single-graph  setting  is  the  counting  method  of  the  occurrence 
frequency.  In  general,  there  are  two  possible  methods  of  the 
frequency  counting.  According  to  the  first  method,  two  em¬ 
beddings  of  a  subgraph  are  considered  different,  as  long  as 
they  differ  by  at  least  one  edge  (i.e.,  non-identical).  As  a  re¬ 
sult,  arbitrary  overlaps  of  embeddings  of  the  same  subgraph 
are  allowed.  On  the  other  hand,  by  the  second  method,  two 
embeddings  are  considered  different,  only  if  they  do  not  share 
edges  (i.e.,  they  are  edge-disjoint).  These  two  methods  are 
illustrated  in  Figure  2.  In  this  example,  there  are  three  pos¬ 
sible  embeddings  of  the  subgraph  shown  in  Figure  2(1)  in 
the  input  graph  of  Figure  2(2).  Two  of  these  embeddings 
(Figures  2(3)  and  (5))  do  not  share  any  edges,  whereas  the 
third  embedding  (Figure  2(4))  shares  edges  with  the  other 
two.  Thus,  if  we  allow  overlaps,  the  frequency  of  the  sub¬ 
graph  is  3,  and  if  we  do  not  it  is  2. 

These  two  ways  of  counting  the  frequency  of  a  subgraph 
lead  to  problems  with  dramatically  different  characteristics. 
If  we  allow  arbitrary  overlaps  between  non-identical  embed¬ 
dings,  then  the  resulting  frequency  is  not  any  longer  down¬ 
ward  closed  (i.e.,  the  frequency  of  a  subgraph  does  not  mono- 
tonically  decrease  as  a  function  of  its  length).  This  is  il¬ 
lustrated  in  Figure  3.  Both  G7  and  G6  are  subgraphs  of 
Q.  Although  the  smaller  subgraph  G6  has  only  one  non¬ 
identical  embedding,  the  larger  G7  has  six  non-identical  em¬ 
beddings.  On  the  other  hand,  if  we  determine  the  frequency  of 
each  subgraph  by  counting  the  maximum  number  of  its  edge- 
disjoint  embeddings,  then  the  resulting  frequency  is  down¬ 
ward  closed  [65]. 

Being  able  to  take  advantage  of  a  frequency  counting 
method  that  is  downward  closed  is  essential  for  the  compu¬ 
tational  tractability  of  most  frequent  pattern  discovery  algo¬ 
rithms.  For  this  reason,  our  problem  formulations  uses  edge- 
disjoint  embeddings.  Given  this,  one  way  of  formulating  the 
frequent  subgraph  discovery  problem  for  the  single-graph  set¬ 
ting  as  follows  [65]: 

Definition  1  (Exact  Discovery  Problem)  Given  an  input 
graph  Q  which  is  undirected  and  labeled,  and  a  parameter 
f,  find  all  connected  undirected  labeled  subgraphs  that  have 
at  least  f  edge-disjoint  embeddings  in  Q. 

Unfortunately  quite  often  this  problem  can  be  intractable.  By 


(1)  Subgraph  (2)  Input  graph  (3)  Embedding  1  (4)  Embedding  2  (5)  Embedding  3 
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Figure  2:  Overlapped  embeddings 


this  definition,  in  order  to  determine  if  a  subgraph  is  frequent 
or  not,  we  need  to  find  whether  the  overlap  graph  of  its  non¬ 
identical  embeddings  contain  an  independent  set  whose  size 
is  at  least  /.  When  a  subgraph  is  relatively  frequent  compared 
to  the  frequency  threshold  /,  by  using  approximate  MIS 
algorithms  we  can  quickly  tell  that  such  a  subgraph  is  actually 
frequent.  However,  in  the  cases  in  which  the  approximate 
MIS  algorithm  does  not  find  a  sufficiently  large  independent 
set,  the  exact  MIS  needs  to  be  computed  before  a  pattern  will 
be  kept  or  discarded.  Depending  on  the  resulting  size  of  the 
maximum  independent  set,  the  subgraph  will  be  identified 
as  frequent  or  infrequent.  Also,  if  we  need  not  only  to  find 
frequent  subgraphs,  but  also  to  find  their  exact  frequency, 
then  the  exact  MIS  needs  to  be  computed  on  the  overlap  graph 
of  every  pattern.  In  both  cases,  because  solving  the  exact 
MIS  problem  is  in  NP-complete  (see  Section  2.2),  the  above 
definition  of  the  frequent  subgraph  discovery  problem  cannot 
be  tractable,  even  for  a  relatively  simple  input  graph. 

To  make  the  problem  more  practical,  we  propose  two 
alternative  formulations  that  can  find  frequent  subgraphs 
without  solving  the  exact  MIS  problem. 


Definition  2  (Approximate  Discovery  Problem)  Given  an 
input  graph  Q  which  is  undirected  and  labeled,  and  a  pa¬ 
rameter  f,  find  connected  undirected  labeled  subgraphs  that 
have  at  least  f  edge-disjoint  embeddings  in  Q  as  much  as 
possible. 


Definition  3  (Upper  Bound  Discovery  Problem)  Given  an 
input  graph  Q  which  is  undirected  and  labeled,  and  a  pa¬ 
rameter  f,  find  all  connected  undirected  labeled  subgraphs 
such  that  an  upper  bound  on  the  number  of  its  edge-disjoint 
embeddings  is  above  the  threshold  f. 

Essentially  the  solutions  for  those  two  problems  become 
a  subset  and  a  superset  of  the  solution  for  Definition  1, 
respectively.  The  first  formulation,  Definition  2,  which  asks 
for  a  subset  of  the  solution  of  Definition  1,  requires  that  the 
embeddings  of  each  subgraph  form  an  overlap  graph  that  has 
an  approximate  MIS  whose  size  is  greater  than  or  equal  to 
/.  The  second  formulation.  Definition  3,  which  asks  for 
a  superset  of  the  solution  of  Definition  1,  requires  that  an 
upper  bound  on  the  size  of  the  exact  MIS  of  this  overlap 
graph  is  greater  than  or  equal  to  /.  Note  that  as  discussed 
in  Section  2.2,  such  upper  bounds  can  be  easily  obtained  for 
both  the  GMIS  algorithm  as  well  as  for  other  approximate 
algorithms. 


5  Algorithms  for  Finding  Frequent  Subgraphs  in  a 
Large  Graph 

We  developed  two  algorithms,  called  hSiGraM  2  and  vSl- 
GraM,  which  find  all  frequent  subgraphs  according  to  Def¬ 
initions  1-3  described  in  Section  4.  In  both  algorithms,  the 
frequent  patterns  are  conceptually  organized  in  a  form  of  a 
lattice  that  is  referred  to  as  the  lattice  of  frequent  subgraphs. 
The  kth  level  of  this  lattice  contains  all  frequent  subgraphs 
with  k  edges  (i.e.,  size-k  subgraphs),  and  a  node  at  level  k 
representing  a  subgraph  Gk  is  connected  to  at  most  k  nodes 
at  level  k  —  1,  each  corresponding  to  a  distinct  (i.e.,  non¬ 
isomorphic)  connected  size-(k  —  1)  subgraph  of  Gk .  The  goal 
of  both  HSiGraM  and  vSiGraM  is  to  identify  the  various 
nodes  of  this  lattice  and  the  frequency  of  the  associated  sub¬ 
graphs. 

The  difference  between  the  two  algorithms  is  the  method 
they  use  to  discover  (i.e.,  generate)  the  nodes  of  the  lattice. 
HSiGraM  follows  a  horizontal  approach  and  discovers  the 
nodes  in  a  breadth-first  fashion,  whereas  VSiGraM  follows 
a  vertical  approach  and  discovers  the  nodes  in  a  depth-first 
fashion.  Both  horizontal  and  vertical  approaches  have  been 
previously  used  to  find  frequent  subgraphs  in  the  graph- 
transaction  setting  [33,  44,  68,  8]  and  have  their  origins 
on  algorithms  developed  for  finding  frequent  itemsets  and 
sequences  [2,  3,  26,  72]. 

A  detailed  description  of  HSiGraM  and  VSiGraM  is 
provided  in  the  rest  of  this  section. 

5.1  Horizontal  Algorithm:  hSiGraM 

The  general  structure  of  HSiGraM  is  shown  in  Algorithm  1 
(the  notation  used  in  the  pseudo-code  is  shown  in  Table  1). 
HSiGraM  takes  as  input  the  graph  Q,  the  minimum  fre¬ 
quency  threshold  /,  and  the  parameter  MIS-type  that  spec¬ 
ifies  the  particular  problem  definition  (as  discussed  in  Sec¬ 
tion  4).  It  starts  by  enumerating  all  frequent  single-  and 
double-edge  subgraphs  in  Q,  and  then  enters  its  main  compu¬ 
tational  loop  (Lines  7-10).  During  each  iteration,  HSiGraM 
first  generates  all  candidate  subgraphs  of  size  k  +  1  by  joining 
pairs  of  size-k  frequent  subgraphs  (Line  8)  and  then  computes 
their  frequency  (hSiGraM-Count  in  Line  11).  The  candi¬ 
date  subgraphs  whose  frequency  is  lower  than  the  minimum 
threshold  /  are  discarded  and  the  remaining  are  kept  for  the 
next  level  of  the  algorithm.  The  computation  terminates  when 
no  frequent  subgraphs  are  generated  during  a  particular  itera¬ 
tion. 

The  two  key  components  of  the  HSiGraM  algorithm  that 
significantly  affect  its  overall  computational  complexity  are 

“SiGraM  stands  for  Single  Graph  Miner. 


Algorithm  1  hSiGraM(CA,  MIS  type,  f ) 

1 :  [>  /  is  the  minimum  frequency  threshold. 

2:  >  MIS-type  is  either  approximate,  exact  or  upper  bound. 
3:  F  <—  0 

4:  JF1  4—  all  frequent  size-1  subgraphs  in  Q 
5:  F~  4—  all  frequent  size-2  subgraphs  in  Q 
6:  A  4-  2 

7:  while  .F*  /  0  do 

8:  C*+1  4-  HSlGRAM-GEN(F*-1,  F*,  /) 

9:  ,F*+1  4-  0 

1 0:  for  each  candidate  C  in  Ck+ 1  do 

11:  C.freq  4-  hSiGraM-C0UNT(C,  MISJype) 

12:  if  C.freq  >  /  then 

13:  add  C  to  Fk+] 

14:  F<r-FGFk+l 

15:  A^A+1 

16:  return  F 


the  methods  used  to  perform  candidate  generation  and  to 
compute  the  frequency  of  the  candidate  subgraphs.  In  the 
rest  of  this  section  we  provide  additional  details  on  how  these 
operations  are  performed  and  describe  various  optimizations 
that  are  designed  to  reduce  their  runtime. 

5.1.1  Candidate  Generation 

hSiGraM  generates  candidate  subgraphs  of  size  k  +  1  by 
joining  two  frequent  size-A:  subgraphs.  In  order  for  two  such 
frequent  size-A:  subgraphs  to  be  eligible  for  joining  each  of  the 
two  must  contain  the  same  size- (A:  —  1)  connected  subgraph. 
The  simplest  way  to  generate  the  complete  set  of  candidate 
subgraphs  is  to  join  all  pairs  of  size-A:  frequent  subgraphs 
that  have  a  common  size-(A:  —  1)  subgraph.  Unfortunately, 
the  problem  with  this  approach  is  that  a  particular  size-A: 
subgraph  may  have  up  to  k  different  size-(A:  —  1)  subgraphs 
and  as  a  result,  if  we  consider  all  such  possible  subgraphs 
and  perform  the  resulting  join  operations,  we  will  end  up 
generating  the  same  candidate  pattern  multiple  times,  and 
generating  a  large  number  of  candidate  patterns  that  are 
not  downward  closed.  Such  an  algorithm  would  spend  a 
significant  amount  of  time  identifying  unique  candidates  and 
eliminating  non-downward  closed  candidates  (both  of  which 
operations  are  non-trivial  as  they  require  to  determine  the 
canonical  label  of  the  generated  subgraphs). 

HSiGraM  addresses  both  of  these  problems  by  only  join¬ 
ing  two  frequent  subgraphs  if  and  only  if  they  share  a  certain, 
properly  selected,  size- (A:  —  1)  subgraph.  Algorithm  2  shows 
the  pseudo-code  for  the  candidate  generation,  where  the  prop¬ 
erly  selected  size-(A:  —  1)  subgraph  is  denoted  by  F.  For  each 
frequent  size-A:  subgraph  Fj,  let  V ( Fj )  —  {//,-  i,  Hi  2}  be  the 
two  size-(A:  —  1)  connected  subgraphs  of  /•’,  such  that  //,  1  has 
the  smallest  canonical  label  and  Hi  2  has  the  second  smallest 
canonical  label  among  the  various  connected  size- (A:  —  1)  sub¬ 
graphs  of  Fj .  We  will  refer  to  these  subgraphs  as  the  primary 
subgraphs  of  Fj.  Note  that  if  every  siz e-(k  —  1)  subgraph  of 
Fj  is  isomorphic  to  each  other,  II,  \  =  11,  2  and  |  'P ( F, )  |  =  1. 
HSiGraM  will  only  join  two  frequent  subgraphs  F,  and  Fj, 
if  and  only  if  V(Fj)  PI  V(Fj )  ^  0,  and  the  join  operation 
will  be  done  with  respect  to  the  common  siz e-(k  —  1)  sub¬ 
graph^).  The  proof  that  this  approach  will  correctly  generate 


Algorithm  2  hSiGraM-Gen {Tk~x,Tk,  f) 

1:  Ck+l  4-  0 
2:  for  each  F  in  Fk~l  do 
3:  for  each  pair  Fj ,  Fj  in  F. children  do 

4:  C  4—  join  F,  and  F;  based  on  F 

5:  >  test  if  the  downward  closure  property  holds. 

6:  <S(C)  4—  all  connected  size-A  subgraphs  of  C 

7:  V(C)  4—  two  primary  subgraphs  of  size  k 

8:  skip  4—  false 

9:  for  each  S  in  S(C)  do 

10:  if  S.freq  <  /  then 

1 1 :  skip  4—  true 

12:  break 

13:  if  skip  /  true  then 

14:  addCtoC*+1 

15:  t>V(C)  =  {HuH2] 

16:  add  C  to  H\  .children  and  to  /^.children 

17:  return  C/+1 


Algorithm  3  hSiGraM-CountCC*-1"1  ,  MIS -type) 


1:  (M(Ck+l).  AICk+x))  4-  hSiGraM-Embed(C,  G) 

2:  G  4—  build  an  overlap  graph  from  A4(Ck^] 1 
3:  {Gi,  G2.  ■  ■  ■ ,  Gm)  4-  decompose  G 
4:  /MIS  0 

5:  for  each  G,  in  (G|,  G2, ....  G,„}  do 
6:  if  Gj  is  easy  to  handle  then 

7:  /MIS  Alls  +  |EMIS(G/)| 

8:  else  if  MISJype  =  approximate  then 

9:  /MIS  <-  /MIS  +  |GMIS(G;)| 

10:  else  if  MISJype  =  exact  then 

11:  /MIS  «-  /MIS  +  |EMIS(G;)| 

12:  else  if  MISJype  =  upper  bound  then 

13:  /MIS  <-  /MIS  +  |GMIS(G;)|  min((A  +  2)/3,  (d  +  2)/2) 

14:  oSCC*-1"1)  is  a  set  of  all  connected  size-A  subgraphs  in  Ck  + 1 
15:  fp  4—  the  lowest  frequency  among  S(Ck+l) 

16:  return  min(/MIS,  fp) 


all  valid  candidate  subgraphs  is  presented  in  [44].  This  candi¬ 
date  generation  approach  dramatically  reduces  the  number  of 
redundant  and  non-downward  closed  patterns  that  are  gener¬ 
ated  and  leads  to  significant  performance  improvements  over 
the  naive  approach  [45], 

5.1.2  Frequency  Counting 

hSiGraM-Count  in  Algorithm  3  computes  the  frequency 
of  a  candidate  subgraph  C  by  first  identifying  all  of  its  embed¬ 
dings,  constructing  the  overlap  graph  of  these  embeddings, 
and  then,  based  on  the  MISJype  parameter,  finding  an  ap¬ 
proximate  or  exact  MIS  of  this  overlap  graph.  The  outline 
of  this  process  is  shown  in  Algorithms  4  and  3.  In  the  rest 
of  this  section  we  first  describe  how  the  various  embeddings 
are  identified  followed  by  a  description  of  the  method  used  to 
efficiently  compute  the  desired  maximal  independent  sets. 

Embedding  Identification  In  order  to  identify  all  the  em¬ 
beddings  of  a  candidate  C,  hSiGraM-Embed  shown  in  Al¬ 
gorithm  4  needs  to  solve  the  subgraph  isomorphism  prob¬ 
lem.  Performing  the  subgraph  isomorphism  for  every  can¬ 
didate  from  scratch  may  be  expensive,  especially  when  an 
input  graph  is  large.  HSiGraM-Embed  reduces  this  com¬ 
putational  requirement  by  using  anchor  edges.  An  anchor 
edge  is  a  partial  embedding  of  a  candidate  C  and  works  as 


Algorithm  4  hSiGraM-Embed(C,  Q) 

1 :  >  A:  a  set  of  all  anchor  edges  of  C 

2:  A  <—  intersection  of  anchor  edges  across  5(C) 

3:  >  collect  all  unique  embeddings  of  C  into  M 
4:  M  <—  0 

5:  for  each  anchor  edge  e  in  A  do 

6:  Me  «—  all  embeddings  of  C  that  includes  the  edge  e 

7:  M  ^  MU  Me 

8:  >  collect  all  unique  anchor  edges  of  C  into  A 

9:  A  ^0 

10:  for  each  embedding  m  in  M  do 

11:  e  <—  choose  one  edge  from  m  arbitrarily 

12:  add  e  to  .4 

13:  return  (M,  A) 


a  constraint  of  the  subgraph  isomorphism  problem  in  which 
narrows  down  the  search  space  only  around  the  anchor  edge. 

More  specifically,  hSiGraM-Embed  creates  and  uses 
anchor  edges  as  follows.  First,  the  list  of  anchor  edges 
are  created  right  after  frequency  counting  for  size-(k  —  1) 
frequent  subgraph,  by  converting  the  list  of  its  non-identical 
embeddings.  These  edges  will  be  used  later  for  counting  a 
candidate  of  size  k.  Let  F,  denote  a  frequent  subgraph  of 
size  k  —  1  and  suppose  Fj  has  N  non-identical  embeddings 
in  total.  After  the  frequency  counting,  F[  has  a  list  of  all  its 
embeddings  Ai(Fi)  =  {m\, . . . ,  m^}-  An  anchor  edge  e  of 
an  embedding  m;  of  F  is  an  edge  in  E  (Q)  that  is  also  a  part  of 
nij.  For  every  m,,  HSiGraM-Embed  arbitrarily  chooses  an 
edge  and  adds  it  to  A(Fj)  (Line  1 1  in  Algorithm  4).  Because 
of  overlapped  embeddings,  some  embeddings  may  lead  to  the 
same  anchor  edge. 

Now,  in  the  next  iteration,  suppose  a  ^-candidate  C  con¬ 
tains  a  frequent  ( k  —  l)-subgraph  F,.  Because  there  are 
k  edges  in  E(C),  C  may  have  up  to  k  distinct  such  fre¬ 
quent  subgraphs  of  size  k  —  1,  and  each  F,  holds  the  an¬ 
chor  edge  list.  Before  starting  the  frequency  counting  of  C, 
first  HSiGraM-Embed  selects  one  of  Fj  whose  frequency 
is  the  lowest  among  {Fj}.  For  each  en  e  A(Fi ),  hSiGraM- 
Embed  checks  if  there  is  an  edge  em  e  A(Fj)  for  all  j  A  * 
such  that  the  shortest  path  length  between  en  and  em ,  denoted 
by  d,  is  within  the  diameter  of  C,  denoted  by  dia(C).  If  there 
is  such  an  edge  em  from  every  A(Fj )  for  j  A  F  en  may  be 
a  part  of  an  embedding  of  C,  because  if  C  is  a  frequent  sub¬ 
graph  of  size  k,  there  must  be  a  set  of  frequent  subgraphs  of 
size  k  —  1  inside  the  same  embedding  of  C.  To  compute  the 
exact  path  length  between  edges  en  and  em  in  Qi  requires  all 
pairs  shortest  paths,  which  may  be  computationally  expensive 
when  | E(Qi)  |  is  large.  HSiGraM-Embed  bounds  this  length 
d  by  the  difference  between  two  lengths,  \dn  —  dm  |,  where  dn 
and  dm  are  the  shortest  path  lengths  from  an  arbitrarily  cho¬ 
sen  vertex  v  e  V  (Qi )  to  en  and  em  respectively.  If  en  and  em 
are  in  the  same  embedding  of  C;,  always  d  <  dia(C)  holds 
and  dn  <  dm  +  d.  Thus,  if  \dn  —  dm  \  <  dia(C)  is  true,  then  en 
and  em  may  belong  to  the  same  embedding  of  C  ,  otherwise 
e„  and  em  cannot  be  in  the  same  embedding  (see  Figure  4). 
If  en  cannot  find  such  em  from  every  A(Fj )  for  j  A  F  e„,  is 
removed  from  A(Fj)  (Line  2).  Because  the  subgraph  isomor¬ 
phism  will  be  performed  for  each  en,  this  pruning  procedure 
can  effectively  reduce  the  run-time. 


dia(G) 


Figure  4:  Distance  estimation  between  two  edges 

Finally,  after  removing  unnecessary  anchor  edges,  for  each 
of  the  remaining  anchor  edges,  all  the  subgraph  isomorphisms 
of  C  are  repeatedly  identified  and  the  set  of  embeddings  A4 
is  built  (Line  6). 

Computing  the  Frequency  The  frequency  of  each  subgraph 
Ck+1  is  computed  by  the  hSiGraM-Count  function  shown 
in  Algorithm  3.  In  particular,  HSiGraM-Count  computes 
two  different  frequencies.  The  first,  denoted  by  /mis*  is 
computed  based  on  the  size  of  the  MIS  of  the  overlap  graph 
created  from  the  embeddings  of  Ck+1.  The  second,  denoted 
by  fp,  is  the  least  frequency  of  all  the  connected  siz e-k 
subgraphs  of  Ck+1  (Line  15),  which  represents  an  upper 
bound  on  C^+1’s  frequency  derived  entirely  from  the  lattice 
of  frequent  subgraphs.  In  the  case  in  which  /mis  is  computed 
using  Definition  3,  the  frequency  bound  provided  by  fp 
may  actually  be  tighter,  and  thus  may  lead  to  more  effective 
pruning.  For  this  reason,  the  overall  frequency  of  Ck+{  is 
obtained  by  taking  the  minimum  of  /mis  and  fp. 

The  frequency  /mis  is  computed  as  follows  (Lines  2- 
13).  Given  a  pattern  and  all  of  its  non-identical  embeddings, 
HSiGraM-Count  generates  its  overlap  graph  G.  Then, 
HSiGraM-Count  decomposes  G  into  its  connected  com¬ 
ponents  Gi,  G 2, . . . ,  Gm  ( m  >  1).  Next,  for  each  connected 
component  Gi ,  it  checks  the  maximum  degree  of  its  vertices 
and  if  it  is  less  that  or  equal  to  two  (a  cycle  or  a  path),  it  com¬ 
putes  its  maximum  independent  set  directly  by  the  EM  IS  al¬ 
gorithm  because  it  is  trivial  to  compute  the  exact  MIS  for  this 
class  of  graphs  (Line  7).  If  the  maximum  degree  is  greater 
than  two,  HSiGraM-Count  uses  either  the  result  of  the 
GMIS  algorithm  (Line  9),  the  result  of  the  EM  IS  algorithm 
(Line  11),  or  the  upper  bound  on  the  size  of  the  exact  MIS 
(Equation  2.1).  The  summation  of  those  MIS  sizes  for  the 
components  is  the  final  value  of  /mis-  Note  that  the  decom¬ 
position  of  the  overlap  graph  into  its  connected  components 
allow  us  to  take  advantage  of  the  properties  of  the  special 
graphs  and  also  obtain  tighter  bounds  for  each  component  as 
the  maximum  degree  for  some  of  them  will  be  lower  than  the 
maximum  degree  of  the  entire  overlap  graph. 

In  addition,  every  edge  is  marked  if  it  is  included  in  any 
embedding  of  a  frequent  subgraph.  Unmarked  edges  are 
removed  before  proceeding  to  the  next  iteration. 

5.2  Vertical  Algorithm:  vSiGraM 

The  most  computationally  expensive  step  in  the  hSiGraM 
algorithm  is  frequency  counting  as  it  needs  to  repeatedly 
perform  subgraph  isomorphism  computations.  The  overall 
time  can  be  greatly  reduced  if  instead  of  storing  only  the 
anchor-edges  we  store  the  complete  set  of  embeddings  across 


successive  levels  of  the  algorithm.  Because  of  hSiGraM’s 
level-by-level  structure,  these  complete  embeddings  need 
to  be  stored  for  the  entire  set  of  frequent  and  candidate 
patterns  of  each  successive  pair  of  levels.  This  substantially 
increases  the  memory  requirements  of  this  approach,  making 
it  impractical  for  the  most  of  interesting  datasets.  On  the  other 
hand,  within  the  context  of  a  vertical  algorithm,  storing  the 
complete  set  of  embeddings  is  feasible  since  we  need  to  do 
that  only  for  the  subgraphs  along  the  path  from  the  current 
node  to  the  root.  Thus,  a  vertical  algorithm  has  potentially  a 
computational  advantage  over  a  horizontal  algorithm,  which 
motivated  the  development  of  vSiGraM. 

However,  before  developing  efficient  algorithms  that  gen¬ 
erate  the  lattice  of  frequent  subgraphs  in  a  depth-first  fashion 
two  critical  steps  need  to  be  addressed.  The  first  step  is  the 
method  that  is  used  to  ensure  that  the  same  node  of  the  lat¬ 
tice  and  the  depth-first  subtree  rooted  at  that  node  should  not 
be  discovered  and  explored  multiple  times.  This  is  impor¬ 
tant  because  each  node  at  level  A  will  be  connected  to  up  to 
k  different  nodes  at  level  (A  —  1).  As  a  result,  if  there  are  no 
mechanisms  by  which  to  prevent  the  repeated  generation  of 
the  same  node,  a  depth-first  algorithm  will  end-up  perform¬ 
ing  redundant  computations  (i.e.,  generating  the  same  nodes 
multiple  times),  adversely  impacting  the  overall  performance 
of  the  algorithm.  VSiGraM  eliminates  these  redundant  com¬ 
putations  by  assigning  each  node  at  level  k  (corresponding  to 
a  subgraph  Fk)  to  a  unique  parent  node  at  level  k  —  1  (corre¬ 
sponding  to  a  subgraph  Fk~l,  such  that  only  Fk~l  is  allowed 
to  create  Fk.  The  subgraph  Fl"  1  is  called  the  generating 
parent  of  Fk.  Details  on  how  this  is  achieved  is  provided  in 
Section  5.2.1. 

The  second  step  is  the  method  that  is  used  to  create  succes¬ 
sor  nodes  in  the  course  of  the  traversal.  In  the  case  of  hSi- 
GraM,  this  corresponds  to  the  candidate  generation  phase, 
and  is  performed  by  joining  the  frequent  subgraphs  of  the  pre¬ 
vious  level.  However,  since  the  lattice  is  explored  in  a  depth- 
first  fashion,  such  joining-based  approach  will  not  work,  as 
the  algorithm  may  not  have  yet  discovered  the  required  fre¬ 
quent  subgraphs.  To  address  this  problem,  VSiGraM  creates 
the  successor  nodes  (i.e.,  extended  subgraphs)  by  analyzing 
all  the  embeddings  of  the  current  subgraph  Fk,  and  identify¬ 
ing  the  distinct  one -edge  extensions  to  these  embeddings  that 
are  sufficiently  frequent.  The  frequent  extensions  for  which 
Fk  is  the  generating  parent  are  then  used  as  the  successor 
nodes  during  the  depth-first  traversal. 

The  general  structure  of  VSiGraM  is  shown  in  Algo¬ 
rithm  5.  VSiGraM  starts  by  determining  all  frequent  size-1 
patterns  and  then  uses  each  one  of  them  as  the  starting  point  of 
a  recursive  depth-first  extension  (vSiGraM-Extend  func¬ 
tion).  vSiGraM-Extend  takes  as  input  a  siz e-A  frequent 
subgraph  Fk  and  all  of  its  embeddings  M(Fk)  in  Q  and  pro¬ 
ceeds  as  follows.  For  each  size-A  embedding  m  e  M{Fk),  it 
identifies  and  stores  every  possible  size-(A  +  1)  subgraph  in 
Q  that  contains  m.  From  this  set  of  subgraphs,  it  extracts  all 
size- (A  + 1)  subgraphs  which  are  not  isomorphic  to  each  other 
and  stores  them  in  Ck+l.  Then,  VSiGraM-Extend  elimi¬ 
nates  from  Ck+l  all  the  subgraphs  that  do  not  have  Fk  as  their 


Algorithm  5  vSiGraM 

vSiGraM(CJ,  MISJype,  /) 
l:  F  <—  0 

2:  F1  <—  all  frequent  size-1  subgraphs  in  Q 

3:  for  each  F1  in  JF1  do 

4:  A4  (F 1 )  <-  all  embeddings  of  F 1 

5:  for  each  F 1  in  JF1  do 

6:  F  <r-  F  U  vSlGRAM-EXTENDfF1 ,  Q,  /) 

7:  return  F 

vSlGRAM-EXTEND(F^,  Q,  MIS-type,  f) 

1:  F  ■*—  0 

2:  for  each  embedding  m  in  M(Fk)  do 

3:  Ck+1  <—  Ck+ 1  U  {all  (A  +  l)-subgraphs  of  Q  containing  m } 

4:  for  each  Ck+i  in  Ck+ 1  do 

5:  if  Fk  is  not  the  generating  parent  of  Ck+ 1  then 

6:  continue 

7:  compute  C^+'.freq  from  _M(C*+1) 

8:  ifC,  +  l  .li-eq  <  /  then 

9:  continue 

10:  add  Ck+X  to  F 

1 1 :  return  F 


generating  parent  (Lines  5-6)  or  are  infrequent  (Lines  7-8). 
The  subgraphs  remaining  in  Ck+i  are  the  frequent  subgraphs 
of  size- (A  +  1)  obtained  by  an  one -edge-extension  of  Fk  and 
are  used  as  input  for  the  next  recursive  call.  The  recursion 
terminates  when  Ck+ 1  =  0,  and  the  depth-first  search  back¬ 
tracks. 

In  the  rest  of  this  section  we  provide  additional  details  on 
how  the  various  operations  are  performed  and  describe  var¬ 
ious  optimizations  that  are  designed  to  reduce  vSiGraM’s 
run-time. 

5.2.1  Generating  Parent  Identification 

The  scheme  that  VSiGraM  uses  to  determine  the  generating 
parent  of  a  particular  subgraph  is  as  follows.  Suppose  a  size- 
(A  +  1)  frequent  subgraph  Fk+l  is  just  created  by  extension 
from  a  size-A  frequent  subgraph  Fk .  By  the  canonical 
labeling,  the  order  of  edges  and  vertices  in  Fk+l  is  uniquely 
determined.  VSiGraM  removes  the  last  edge  that  does  not 
disconnect  Fk+ 1  and  obtains  another  siz  e-k  subgraph  F. 

If  F  is  isomorphic  to  Fk  then  Fk  becomes  the  generating 
parent  of  Fk+i ,  and  VSiGraM  keeps  the  further  exploration 
from  Fk+i .  Similar  type  of  approaches  have  been  used  earlier 
in  the  context  of  vertical  algorithms  for  the  graph-transaction 
setting  [65,  68].  All  of  these  share  the  same  idea,  which 
avoids  redundant  frequent  pattern  generation  and  traverses 
the  lattice  of  patterns  as  if  it  was  a  tree. 

5.2.2  Efficient  Subgraph  Extension 

Starting  from  a  frequent  siz  e-k  subgraph,  VSiGraM  obtains 
the  extended  subgraphs  of  size  k  +  1  by  adding  an  additional 
edge  (while  preserving  connectivity)  to  all  of  its  possible  em¬ 
beddings.  Specifically,  for  each  embedding  m  of  a  frequent 
A-subgraph  F,  vSiGraM  enumerates  all  the  edges  that  can 
be  added  to  m  to  form  a  size- (A  +1)  extended  subgraph. 
Each  of  those  edges  is  represented  by  a  tuple  of  5  elements 
s  =  (x,  y,  u,  v,  e),  called  a  stem,  where  x  and  y  are  the  vertex 
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Figure  5:  Size-6  graph  G,  canonical  vertex  IDs,  and  canonical 
automorphism 

IDs  of  the  edge  in  Q,  u  and  v,  it  <  v,  are  the  corresponding 
vertex  IDs  in  F,  and  e  is  the  label  of  the  edge.  For  u  and  v, 
if  there  is  no  corresponding  vertex  in  F,  —  1  is  used  to  show 
that  it  is  outside  the  subgraph  F . 

However,  because  of  the  automorphism  of  the  subgraph 
F,  we  cannot  use  this  stem  representation  directly.  For 
a  particular  embedding  m  of  a  frequent  subgraph  F  in  Q, 
there  may  be  more  than  one  vertex  mapping  of  the  subgraph 
onto  the  embedding.  If  we  simply  used  a  pair  of  vertex 
IDs  of  the  subgraph  to  represent  a  stem,  depending  on  the 
mapping,  the  same  edge  addition  might  be  considered  a 
different  stem,  which  would  result  in  the  wrong  frequency 
of  the  subgraph.  To  avoid  this  problem,  every  time  a  stem 
is  generated,  its  representation  is  normalized  as  follows  . 
vSiGraM  enumerates  all  possible  automorphisms  of  F, 
denoted  by  {0,  }.  By  an  appropriate  0,  we  obtain  the  canonical 
vertex  ID  for  every  vertex  v  e  V ( F ).  The  canonical  ID  of  a 
vertex  v,  denoted  by  cvid(u),  is  defined  as 

cvid(u)  =  min  0(-  (u). 
i 

The  automorphism  with  the  least  subscript  that  gives  the 
canonical  ID  for  v  is  called  the  canonical  automorphism, 
denoted  by  0* . 

0*  =  arg  min  0/  ( v) ,  i  <  j  if  0;  (v)  =  0;-  (v) 

0z 

For  example,  given  the  size-6  graph  G  shown  in  Figure  5(a), 
cvid(i>3)  =  iq  and  0*3  =  02-  Figure  5(b)  shows  cvid  and  0* 
for  every  vertex  in  G.  Note  that  although  03(1)3)  is  also  v\, 
because  02  has  the  smaller  subscript,  2,  0*3  is  02.  Now  for 
each  stem  s  =  (x,  y,  u,  v,  e),  0*(n,  v )  =  (u! ,  v')  are  defined 
as  follows. 

u  =  cvid(M),  1/  =  0*(u)  if  cvid(w)  <  cvid(ii) 

u'  =  0*(n),  v1  =  cvid(i>)  otherwise 

Then,  stem  s  is  rewritten  as  (x,  y ,  u' ,  v\  e),  which  is  automor¬ 
phism  invariant  representation  of  5  and  is  used  by  VSiGraM 
to  properly  determine  the  frequency  of  size-(k  +  1)  extended 
subgraphs. 

Additional  Optimization:  Keeping  Track  of  Edge  Cre¬ 
ation  Status  Each  frequent  subgraph  maintains  a  three- 
dimensional  table,  called  a  connection  table.  Each  element  in 
the  table  is  denoted  by  ct(w',  v' ,  e)  which  shows  if  it  is  possi¬ 
ble  to  form  an  edge  between  the  vertices  u'  and  v'  whose  edge 
label  is  e.  Every  time  a  stem  (x,  y.  u' .  1/,  e)  is  discarded,  the 


corresponding  element  in  the  connection  table  is  updated  to 
show  that  it  is  now  impossible  to  create  an  edge  with  a  label 
e  between  «'  and  v' .  If  ct(n',  v' ,  e)  is  deactivated  for  a  fre¬ 
quent  subgraph  of  size  k ,  then  for  any  l  >  k ,  there  should  not 
be  any  frequent  subgraph  that  has  an  edge  between  u'  and  1/ 
with  the  edge  label  e.  We  can  reduce  the  number  of  stems  to 
be  generated  by  looking  up  the  connection  table  during  the 
stem  enumeration  phase. 

5.2.3  Frequency  Counting 

In  the  vertical  algorithm,  when  a  size-(k  +  1)  extension  is 
processed,  there  is  only  one  siz e-k  frequent  subgraph  visible, 
the  generating  parent.  vSiGraM’s  frequency  counting  is 
similar  to  hSiGraM-Count,  except  for  the  computation 
of  fp  (see  Line  15  in  Algorithm  3).  hSiGraM  enforces 
the  downward  closure  property  on  the  frequency  of  a  size- 
( k  +  1)  candidate,  by  using  the  least  frequency  of  all  siz  e-k 
subgraphs  of  the  candidate.  VSiGraM  cannot  take  the  same 
step  because  VSiGraM  does  not  hold  all  siz  e-k  frequent 
subgraphs  at  the  time  a  size-(k  +  1)  extended  subgraph  is 
created.  Instead  VSiGraM  simply  uses  the  frequency  of  the 
siz  e-k  generating  parent  from  which  the  current  size-(k  +  1) 
extension  is  obtained.  As  a  result,  VSiGraM’s  pruning  is 
looser  than  that  of  HSiGraM. 

6  Experimental  Evaluation 

In  this  section,  we  study  the  performance  of  the  proposed 
algorithms  with  various  parameters  and  real  datasets.  All 
experiments  were  done  on  dual  AMD  Athlon  MP  1800+ 
(1.53  GHz)  machines  with  2  GBytes  main  memory,  running 
the  Linux  operating  system.  All  the  run-times  reported  are  in 
seconds. 

6.1  Datasets 

We  used  six  different  datasets,  each  obtained  from  a  different 
domain,  to  evaluate  and  compare  the  performance  of  hSi- 
GraM  and  VSiGraM.  The  basic  characteristics  of  these 
datasets  are  shown  in  Table  2.  Note  that  even  though  some 
of  these  graphs  consist  of  multiple  connected  components, 
the  HSiGraM  and  VSiGraM  algorithm  treat  them  as  one 
large  graph  and  discover  the  frequent  patterns  according  to 
Definitions  1-3  described  in  Section  4. 

The  Aviation  and  Credit  datasets  are  obtained  from  [64]. 
The  Aviation  dataset  is  originally  from  the  Aviation  Safety 
Reporting  System  Database  and  the  Credit  dataset  is  from  the 
UCI  machine  learning  repository  [7],  The  directed  edges  in 
the  original  graph  data  were  converted  into  undirected  ones. 
For  the  Aviation  dataset,  we  removed  undirected  edges  to 
show  “near_to”  relation  between  two  vertices  because  those 
edges  form  cliques  which  makes  this  graph  difficult  to  mine. 

The  Citation  dataset  was  created  from  the  citation  graph 
used  in  KDD  Cup  2003  [37].  Each  vertex  in  this  graph 
corresponds  to  a  document  and  each  edge  corresponds  to  a 
citation.  Because  our  algorithms  are  for  undirected  graphs, 
the  direction  of  these  citations  was  ignored.  Since  the  original 
dataset  does  not  have  any  meaningful  label  for  vertices,  we 


generated  vertex  labels  as  follows.  We  first  used  a  clustering 
algorithm  to  form  clusters  of  the  document  abstracts  into  50 
thematically  coherent  topics,  and  then  assigned  the  cluster  ID 
as  the  label  to  the  corresponding  vertices.  For  the  edges,  we 
used  as  labels  the  difference  in  the  publication  year  of  the  two 
papers.  For  example,  if  two  papers  were  published  in  1997 
and  2002,  an  edge  is  created  between  those  two  document 
vertices  with  the  label  “5”.  Finally,  because  some  of  the 
vertices  in  the  resulting  graph  had  a  very  high  degree  (i.e., 
authorities  and  hubs),  we  kept  only  the  vertices  whose  degree 
was  less  or  equal  to  15. 

The  Contact  Map  dataset  is  made  of  170  proteins  from 
the  Protein  Data  Bank  [5]  with  pairwise  sequence  identity 
lower  than  25%.  The  vertices  in  these  graphs  correspond  to 
the  different  amino  acids  and  the  edges  connect  two  amino 
acids  if  they  are  either  at  consecutive  sequence  positions  or 
they  are  in  contact  in  their  3D  structure.  To  amino  acids  are 
considered  to  be  in  contact  if  the  distance  between  their  Ca 
atoms  is  less  than  8  A.  Furthermore,  while  creating  the  graphs 
we  only  considered  non-local  contacts  that  are  defined  as  the 
contacts  between  amino  acids  whose  sequence  separation  is 
at  least  six  amino  acids. 

The  DTP  dataset  is  a  collection  of  2,319  chemical  com¬ 
pounds  randomly  selected  from  the  dataset  of  223,644  chem¬ 
ical  compounds  provided  by  the  Developmental  Therapeutics 
Program  (DTP)  at  National  Cancer  Institute* 3.  Note  that  each 
chemical  compound  forms  a  connected  component  and  there 
are  2,319  such  components  in  this  dataset.  Each  vertex  corre¬ 
sponds  to  an  atom  and  its  label  represents  the  atom  type.  An 
edge  is  formed  between  two  vertices  if  the  corresponding  two 
atoms  are  connected  by  a  bond.  The  type  of  a  bond  is  used  as 
an  edge  label,  and  there  are  three  distinct  edge  labels. 

Finally,  the  VLSI  dataset  was  obtained  from  the  Interna¬ 
tional  Symposium  on  Physical  Design  ’98  (ISPD98)  bench¬ 
mark  suite4  and  corresponds  to  the  netlist  of  a  real  circuit. 
The  netlist  was  converted  into  a  graph  by  first  removing  any 
nets  that  are  longer  than  four  and  then  using  a  star-based  ap¬ 
proach  to  replace  each  net  (i.e.,  hyperedge)  by  a  set  of  edges. 
Note  that  for  this  dataset  we  limited  the  size  of  the  largest  dis¬ 
covered  pattern  to  five  edges.  This  is  because  for  the  values 
of  the  frequency  threshold  used  in  our  experiments,  the  only 
frequent  patterns  that  contained  more  than  five  edges  were 
paths,  and  because  of  the  highly  connected  nature  of  the  un¬ 
derlying  graph,  there  were  a  very  large  number  of  such  paths, 
making  it  hard  to  find  these  longer  path  patterns  in  reasonable 
amount  of  time. 

6.2  Results 

Table  3  shows  the  results  obtained  by  the  hSiGraM  and 
vSiGraM  algorithms  for  the  different  datasets,  for  a  wide 
range  of  the  minimum  frequency  threshold  values  /,  and  the 
three  different  MIS-based  problem  definitions.  For  each  ex¬ 
periment,  Table  3  shows  the  amount  of  time  (in  seconds)  re- 

fDTP  2D  and  3D  Structural  Information,  http://dtp.nci.nih.gov/docs/ 

3d_database/structural  Jnformation/structuraLdata.html 

4http://vlsicad. cs.ucla.edu/~cheese/ispd98.html 


Table  2:  Datasets  used  in  the  experiments 


Dataset 

Connected 

Components 

Vertices 

Edges 

Labels 

Vertex  Edge 

Aviation 

2703 

101185 

196964 

6173 

51 

Credit 

700 

14700 

28000 

59 

20 

Citation 

16999 

29014 

42064 

50 

12 

Contact  Map 

170 

33443 

224488 

21 

2 

DTP 

2319 

41190 

86140 

58 

3 

VLSI 

2633 

12752 

23084 

23 

1 

quired  by  the  particular  algorithm,  the  total  number  of  pat¬ 
terns  that  were  discovered,  and  size  of  the  largest  pattern. 
Entries  in  the  table  marked  with  “ — ”  represents  experiments 
that  were  aborted  because  of  high  computational  require¬ 
ments. 

From  these  results  we  can  see  that  as  expected,  for  all 
datasets  and  algorithms,  as  the  value  of  /  decreases,  the  run¬ 
time  for  finding  the  frequent  patterns  increases  as  well.  The 
rate  of  increase  in  runtime  follows  the  corresponding  rate  of 
increase  in  the  number  of  patterns  that  are  being  discovered. 
Besides  that,  the  results  in  this  table  help  illustrate  the  relation 
between  the  two  key  variables  in  these  experiments,  which 
are  the  type  of  the  particular  algorithm  (HSiGraM  vs  vSl- 
GraM)  and  the  type  of  frequency  calculation  (approximate 
MIS,  exact  MIS,  or  upper  bound  MIS). 

In  general,  the  amount  of  time  required  by  VSiGraM  is 
smaller  than  that  required  by  HSiGraM.  In  fact,  as  the  value 
of  the  frequency  threshold  decreases,  VSiGraM  is  up  to  five 
times  faster  than  HSiGraM.  This  is  true  across  all  datasets 
for  the  approximate  and  exact  MIS  problem  formulation,  and 
for  those  datasets  for  which  the  upper  bound  MIS  formulation 
leads  to  the  same  number  of  frequent  patterns  for  both 
algorithms.  As  discussed  in  Section  5.2,  the  reason  for  that 
performance  advantage  is  the  fact  that  by  keeping  track  the 
embeddings  of  the  frequent  subgraphs  along  the  depth-first 
path,  VSiGraM  spends  significantly  less  time  in  subgraph 
isomorphism  related  computations  than  HSiGraM  does. 

However,  for  certain  datasets,  when  the  upper  bound  MIS 
formulation  is  used,  VSiGraM  ends  up  generating  signifi¬ 
cantly  more  patterns  than  those  generated  by  HSiGraM.  For 
example,  in  the  case  of  the  DTP  dataset  and  /  =  20,  vSi¬ 
GraM  generates  almost  16  times  more  patterns  than  hSi- 
GraM.  In  such  cases,  the  amount  of  time  required  by  vSl- 
GraM  is  substantially  greater  than  that  required  by  hSi- 
GraM  (32.4  times  greater  in  the  DTP  example).  The  rea¬ 
son  for  that  is  the  fact  that  because  of  its  depth-first  nature, 
VSiGraM  cannot  take  advantage  of  the  frequent  subgraph 
lattice  to  get  a  tight  upper  bound  on  the  frequency  of  a  sub¬ 
graph  based  on  the  frequency  of  all  of  its  subgraphs,  and  it 
bases  its  upper  bound  only  on  the  frequency  of  the  generat¬ 
ing  parent.  On  the  other  hand,  because  of  its  level-by-level 
nature,  HSiGraM  can  use  the  information  from  all  its  sub¬ 
patterns,  and  obtains  better  upper  bounds  (see  discussion  in 
Section  5.1.2). 

Comparing  the  different  MIS -based  problem  formulations, 
we  can  see  that  the  one  based  on  the  approximate  MIS  usu¬ 
ally  leads  to  the  fastest  execution  time  for  both  algorithms. 
Moreover,  for  datasets  for  which  the  various  overlap  graphs 


Table  3:  Run-time  in  seconds  and  the  number  of  found  frequent  patterns  for  the  different  datasets,  this  is  that  and  that  is  this 
and  it  is  what  and  what  is  it. 


Aviation 


Run-time  [sec] 

Number  of  Found  Patterns 

Largest  Pattern  Size 

/ 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

H 

V 

H  V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

H  V 

H  V 

2000 

308 

130 

306  130 

320 

130 

833 

833 

833 

833 

833 

833 

8 

8 

8  8 

8 

1750 

779 

342 

787  342 

789 

341 

2249 

2249 

2249 

2249 

2249 

2249 

9 

9 

9  9 

9 

1500 

1603 

743 

1674  745 

1584 

739 

5207 

5207 

5207 

5207 

5207 

5207 

10 

10 

10  10 

10 

1250 

2726 

1461 

2720  1496 

2781 

1486 

11087 

11087 

11087 

11087 

11087 

11087 

12 

12 

12  12 

12 

1000 

5256 

3667 

5158  3683 

5596 

3818 

30331 

30331 

30331 

30331 

30331 

30331 

13 

13 

13  13 

13 

8 

9 

10 

12 

13 


Citation 


Contact 

Map 


Credit 


DTP 


VLSI 


Run-time[sec] 


Number  of  Found  Patterns 


Largest  Pattern  Size 


/ 

Apprx. 

H  V 

Exact 

H  V 

UB 

H  V 

Apprx. 

H  V 

Exact 

H  V 

UB 

H  V 

Apprx. 
H  V 

Exact 
H  V 

H 

UB 

V 

100 

0.1 

0.0 

0.1 

0.0 

0.1  0.0 

6 

6 

6 

6 

7 

1 1 

1 

1 

1 

1 

2 

b 

50 

0.1 

0.1 

0.1 

0.1 

0.6  — 

39 

39 

39 

39 

113 

— 

2 

2 

2 

2 

7 

— 

20 

0.6 

0.3 

0.9 

0.5 

139  — 

266 

266 

266 

266 

12203 

— 

3 

3 

3 

3 

16 

— 

10 

4.0 

1.5 

4.2 

1.9 

—  — 

986 

986 

988 

988 

— 

— 

5 

5 

5 

5 

— 

— 

Run-time[sec] 

Number  of  Found  Patterns 

Largest  Pattern  Size 

/ 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

H 

V 

H 

V 

H  V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

400 

3 

2 

3 

2 

10  — 

100 

100 

100 

100 

246 

— 

2 

2 

2 

2 

8 

— 

300 

10 

3 

10 

3 

183  — 

186 

186 

186 

186 

2358 

— 

2 

2 

2 

2 

10 

— 

200 

44 

9 

45 

9 

—  — 

505 

505 

505 

505 

— 

— 

3 

3 

3 

3 

— 

— 

100 

362 

63 

356 

71 

—  — 

3183 

3183 

3186 

3186 

— 

— 

5 

5 

5 

5 

— 

— 

50 

3505 

607 

3532 

632 

—  — 

29237 

29237 

29298 

29298 

— 

— 

6 

6 

6 

6 

— 

— 

Run-time[sec] 

Number  of  Found  Patterns 

Largest  Pattern  Size 

/ 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

H 

V 

H 

V 

H  V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

500 

0 

0 

0 

0 

0  0 

24 

24 

24 

24 

24 

24 

3 

3 

3 

3 

3 

3 

200 

10 

4 

10 

4 

9  4 

1325 

1325 

1325 

1325 

1325 

1325 

7 

7 

7 

7 

7 

7 

100 

49 

20 

45 

21 

45  20 

11696 

11696 

11696 

11696 

11696 

11696 

9 

9 

9 

9 

9 

9 

50 

169 

78 

172 

80 

169  78 

73992 

73992 

73992 

73992 

73992 

73992 

11 

11 

11 

11 

11 

11 

20 

2019 

461 

1855 

468 

1880  462 

613884 

613884 

613884 

613884 

613884 

613884 

13 

13 

13 

13 

13 

13 

Run-time[sec] 

Number  of  Found  Patterns 

Largest  Pattern  Size 

/ 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

H 

V 

H 

V 

H  V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

500 

92 

20 

86 

21 

96  30 

109 

109 

109 

109 

153 

226 

7 

7 

7 

7 

12 

13 

200 

101 

23 

100 

24 

115  38 

414 

414 

415 

415 

641 

916 

9 

9 

9 

9 

15 

15 

100 

113 

27 

114 

27 

169  64 

1244 

1244 

1244 

1244 

2484 

3788 

12 

12 

12 

12 

16 

18 

50 

145 

34 

134 

35 

247  103 

4028 

4028 

4028 

4028 

8295 

13622 

14 

14 

14 

14 

18 

21 

20 

243 

86 

249 

83 

616  19998 

21477 

21477 

21478 

21478 

52180 

824702 

16 

16 

16 

16 

20 

81 

10 

813 

311 

882 

294 

2018  — 

112535 

112535 

112539 

112539 

232810 

— 

21 

21 

21 

21 

21 

— 

Run-time[sec] 

Number  of  Found  Patterns 

Largest  Pattern  Size 

/ 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

Apprx. 

Exact 

UB 

H 

V 

H 

V 

H  V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

H 

V 

200 

11 

3 

— 

— 

37  8 

137 

137 

— 

— 

347 

415 

5 

5 

— 

— 

5 

b 

150 

13 

4 

— 

— 

46  9 

156 

156 

— 

— 

437 

503 

5 

5 

— 

— 

5 

5 

100 

42 

7 

— 

— 

54  10 

379 

379 

— 

— 

519 

609 

5 

5 

— 

— 

5 

5 

75 

49 

8 

— 

— 

56  10 

409 

409 

— 

— 

571 

679 

5 

5 

— 

— 

5 

5 

50 

236 

15 

— 

— 

282  17 

683 

683 

— 

— 

946 

1051 

5 

5 

— 

— 

5 

5 

25 

428 

18 

— 

— 

469  20 

1452 

1452 

— 

— 

1907 

2131 

5 

5 

— 

— 

5 

5 

Note.  Dashes  indicate  the  computation  was  aborted  because  of  the  too  long  run-time  or  memory  exhaustion. 

/:  the  minimum  frequency  threshold,  H:  hSiGraM,  V:  vSiGraM,  Approx.:  with  approximate  MIS,  Exact:  with  exact  MIS,  UB:  with  upper  bound  MIS 


are  reasonably  small  (this  is  true  for  all  our  datasets  except 
VLSI),  the  exact  MIS-based  formulation  leads  to  small  exe¬ 
cution  time  as  well.  Also,  the  upper  bound  MIS  formulation 
tends  to  be  slower  than  the  other  two  primarily  because  it 
generates  more  patterns.  However,  the  advantage  of  the  up¬ 
per  bound  formulation  over  the  one  based  on  the  exact  MIS 
can  be  seen  for  the  VLSI  graph  for  which  the  resulting  over¬ 
lap  graph  was  large,  and  exact  MIS  computations  could  not 
finish  in  reasonable  amount  of  time.  Finally,  comparing  the 
number  of  patterns  found  by  the  approximate  and  the  exact 
MIS-based  formulations,  we  can  see  that,  in  general,  the  ap¬ 
proximate  algorithm  fails  to  discover  a  very  small  number  of 
patterns. 


Table  4:  SUBDUE  Results 


Dataset 

Run-time 

[sec] 

Number  of 
Patterns 

Pattern 

Size 

Frequency  of 
Found  Patterns 

Aviation 

— 

Citation 

8812 

3 

27  26  27 

1  1  1 

Contact  Map 

5043 

3 

224  223  223 

1  1  1 

Credit 

517 

3 

6  5  5 

341  395  387 

DTP 

1525 

3 

2  2  6 

4957  4807  1950 

VLSI 

16 

3 

1  1  1 

773  773  244 

6.3  Performance  Comparison  with  Existing  Algorithms 

Comparison  with  SUBDUE  We  ran  SUBDUE  [28]  version 
5.0.6s  on  the  same  datasets  described  in  Section  6.1  and 


'’Although  this  version  is  not  the  latest  one,  it  runs  significantly  faster 
than  the  current  latest  version,  5.0.8. 


measured  the  run-time,  the  number  of  discovered  patterns, 
their  size,  and  their  frequency.  These  results  are  shown  in 
Table  4.  These  results  were  obtained  by  using  SUBDUE’s 
default  settings  for  all  but  the  VLSI  dataset.  For  the  VLSI 
dataset,  we  run  SUBDUE  so  that  to  find  subgraphs  that 
contain  at  most  five  edges,  as  was  done  in  the  case  of 
hSiGraM  and  vSiGraM.  Note  that  SUBDUE’s  default 
settings  returns  at  most  three  subgraphs  that  were  determined 
to  be  the  most  important. 

Because  of  the  inherent  differences  between  SUBDUE  and 
our  algorithms,  it  is  impossible  to  perform  a  direct  compari¬ 
son  of  the  results  that  they  generate.  For  this  reason  our  com¬ 
parisons  will  focus  mostly  on  highlighting  some  key  points. 
First,  the  amount  of  time  required  by  SUBDUE  is  in  gen¬ 
eral,  considerably  higher  than  that  required  by  our  algorithms. 
For  example,  SUBDUE  did  not  finish  the  computation  for  the 
Aviation  dataset  after  spending  four  entire  days.  Also  for  the 
Citation  and  Contact  Map  datasets,  SUBDUE  could  not  find 
any  meaningful  patterns  at  all,  as  the  patterns  that  it  found 
had  a  frequency  of  one.  For  the  Credit  dataset  with  the  min¬ 
imum  frequency  threshold  of  50,  both  HSiGraM  and  vSl- 
GraM  with  upper  bound  MIS  spent  169  and  78  seconds  re¬ 
spectively  to  discover  the  same  number  of  subgraphs,  73992. 
The  largest  pattern  has  1 1  edges  and  had  a  frequency  of  58.  In 
contrast,  the  largest  pattern  found  by  SUBDUE  had  six  edges 
with  a  frequency  of  341.  This  indicates  that  if  there  are  small 
subgraphs  that  have  relatively  high  frequency,  SUBDUE  will 
focus  on  them  and  will  not  discover  the  larger  patterns.  We 
can  see  the  similar  result  for  the  DTP  dataset.  The  size  of 
the  patterns  SUBDUE  found  are  very  small,  2-6  edges,  but 
their  frequency  is  very  high.  On  the  other  hand,  the  results 
in  Table  3  show  that  with  the  minimum  frequency  threshold 
20,  both  HSiGraM  and  VSiGraM  under  exact  MIS  spend 
249  and  83  seconds  respectively  to  find  21,478  frequent  sub¬ 
graphs,  and  the  largest  size  is  16. 

Comparison  with  SEuS  The  SEuS  [21]  algorithm  is  de¬ 
signed  to  find  all  frequent  subgraphs  in  a  single-graph  set¬ 
ting.  However,  when  determining  the  frequency  of  a  sub¬ 
graph  they  consider  all  embeddings  irrespective  of  whether 
they  are  disjoint  or  not.  As  a  result,  a  subgraph  may  have  high 
frequency  even  though  it  has  small  number  of  edge-disjoint 
embeddings  because  of  overlapped  embeddings.  In  [21],  the 
run-time  of  SEuS  on  the  PTE  chemical  dataset6  is  reported. 
SEuS  (SEuS-Sl)  spent  more  than  20  seconds  to  find  34  fre¬ 
quent  subgraphs,  that  is  1.4  frequent  subgraphs  per  second. 
On  the  same  dataset  given  the  minimum  frequency  thresh¬ 
old  of  500,  VSiGraM  with  upper  bound  MIS  requires  20 
seconds  to  find  168  frequent  subgraphs,  which  translates  to 
8.4  frequent  subgraphs  per  second.  Similarly,  with  the  Credit 
dataset  (which  is  called  “Credit-4”  in  [20]),  SEuS-Sl  spent 
50  seconds  to  produce  48  frequent  subgraphs  (one  frequent 
subgraphs  per  second),  while  VSiGraM  with  upper  bound 
MIS  finds  1,325  frequent  subgraphs  in  four  seconds  for  the 
minimum  frequency  threshold  200  (331  frequent  subgraphs 

6  ftp://ftp.comlab.ox. ac.uk/pub/Packages/ILP/Datasets/carcinogenesis/ 
progol/carcinogenesis.tar.Z 


per  second). 

7  Conclusions 

In  this  paper  we  addressed  the  problem  of  finding  all  the 
subgraphs  that  have  many  edge-disjoint  embeddings  in  a 
large  sparse  graph,  a  step  critical  to  discovering  patterns 
in  graph  datasets.  We  studied  three  distinct  formulations 
of  the  problem  that  were  motivated  by  the  complexity  of 
identifying  the  maximum  set  of  edge-disjoint  embeddings  of 
a  subgraph,  and  developed  two  frequent  subgraph  mining 
algorithms  for  solving  them.  These  algorithms  are  based 
on  the  horizontal  and  vertical  paradigms,  respectively.  Our 
experimental  evaluation  on  many  real  datasets  showed  that 
for  most  datasets  and  problem  formulations  both  algorithms 
achieve  good  performance,  with  the  vertical  algorithm  being 
two-to-five  times  faster. 
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